Word Sense Disambiguation: Why Statistics When We Have These Numbers?

نویسندگان

  • Kavi Mahesh
  • Sergei Nirenburg
  • Stephen Beale
  • Boyan Onyshkevych
  • Evelyne Viegas
  • Victor Raskin
چکیده

Word sense disambiguation continues to be a di cult problem in machine translation (MT). Current methods either demand large amounts of corpus data and training or rely on knowledge of hard selectional constraints. In either case, the methods have been demonstrated only on a small scale and mostly in isolation, where disambiguation is a task by itself. It is not clear that the methods can be scaled up and integrated with other components of analysis and generation that constitute an end-to-end MT system. In this paper, we illustrate how the Mikrokosmos Knowledge-Based MT system disambiguates word senses in real-world texts with a very high degree of correctness. Disambiguation in Mikrokosmos is achieved by a combination of (i) a broad-coverage ontology with many selectional constraints per concept, (ii) a large computational-semantic lexicon grounded in the ontology, (iii) an optimized search algorithm for checking selectional constraints in the ontology, and (iv) an e cient control mechanism with near-linear processing complexity. Moreover, Mikrokosmos constructs complete meaning representations of an input text using the chosen word senses. 1 Word Sense Ambiguity Word sense disambiguation continues to be a di cult problem for machine translation (MT) systems.The most common current methods for resolving word sense ambiguities are based on statistical collocations or static selectional preferences between pairs of word senses. The real power of word sense selection seems to lie in the ability to constrain the possible senses of a word based on selections made for other words in the local context. Although methods using selectional constraints and semantic networks have been delineated at least since Katz and Fodor (1963), computational models have not demonstrated the e ectiveness of knowledgebased methods in resolving word senses in real-world texts on a large scale. This has resulted in a predominant shift of attention from knowledge-based to corpus-based, statistical methods for word sense resolution, despite the far greater potential of knowledge-based methods for advancing the development of large, practical, domain independent NLP/MT systems. In this article, we illustrate how the semantic analyzer of the Mikrokosmos machine translation system resolves word sense ambiguities in real-world Spanish texts (news articles on company mergers and acquisitions from the EFE newswire) with a high degree of correctness. We begin by presenting the results from Mikrokosmos and then illustrate how they were obtained. 1 See Guthrie et al (1996) and Wilks et al (1995) for recent surveys of related work. Text #1 #2 #3 #4 Average # words 347 385 370 353 364 # words/sentence 16.5 24.0 26.4 20.8 21.4 # open-class words 183 167 177 177 176 # ambiguous open-class words 57 42 57 35 48 # resolved by syntax 21 19 20 12 18 total # correctly resolved 51 41 45 34 43 % correct 97% 99% 93% 99% 97% Table 1. Mikrokosmos Results in Disambiguating Open Class Words in Spanish Texts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

رفع ابهام معنایی واژگان مبهم فارسی با مدل موضوعی LDA

Word sense disambiguation is the task of identifying the correct sense for the word in a given context among a finite set of possible sense. In this paper a model for farsi word sense disambiguation is presented. The model use two group of features: first, all word and stop words around target word and topic models as second features. We extract topics from a farsi corpus with Latent Dirichlet ...

متن کامل

Utilizing corpus statistics for hindi word sense disambiguation

Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context. This paper compares three WSD algorithms for Hindi WSD based on corpus statistics. The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs). These weights are used in the disambig...

متن کامل

Word Sense Disambiguation in Roget's Thesaurus Using WordNet

We describe a simple method of disambiguating word senses in Roget's Thesaurus using information about the sense of the word in WordNet. We present a few variations on this method, compare their performance and discuss the results. We explain why this type of disambiguation can be useful.

متن کامل

On the Importance of Word Sense Disambiguation for Information Retrieval

Research in information retrieval has led to mixed results about the impact of natural language processing. This paper discusses the importance of word sense disambiguation despite these mixed results. We first discuss some of the factors that can cause apparent inconsistency in retrieval performance with regard to natural language processing: instability of test collection queries, different b...

متن کامل

Probabilistic word sense disambiguation

We present a theoretically motivated method for creating probabilistic word sense disambiguation (WSD) systems. The method works by composing multiple probabilistic components: such modularity is made possible by an application of Bayesian statistics and Lidstone’s smoothing method. We show that a probabilistic WSD system created along these lines is a strong competitor to state-of-the-art WSD ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997